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Abstract 

Within the preprocessing pipeline of a Next Generation Sequencing sample, its set of Single-Base Mis- 
matches is one of the first outcomes, together with the number of correctly aligned reads. The union of 
these two sets provides a 4 x 4 matrix (called Single Base Indicator, SBI in what follows) representing a 
blueprint of the sample and its preprocessing ingredients such as the sequencer, the alignment software, 
the pipeline parameters. In this note we show that, under the same technological conditions, there is a 
strong relation between the SBI and the biological nature of the sample. To reach this goal we need to 
introduce a similarity measure between SBIs: we also show how two measures commonly used in machine 
learning can be of help in this context. 

Introduction 

A first measure of the goodness of alignment of a Next Generation Sequencing (NGS) sample is given on 
one side by the number of correctly aligned reads, and on the other side on the number of wrongly detected 
bases, collected in the single-base mismatches count. All these values are among the basic outcomes of 
the sample preprocessing, regardless of the ingredients of the used pipeline. These information can be 
grouped together in a 4 x 4 matrix of integer values indexed by the four bases A,C,G,T, where the entry 
in position X,Y is the number of true bases (in the reference genome) X interpreted as Y. In what follows, 
we call this matrix the Single-Base Indicator, SBI for short, of the studied sample. 

Obviously, the SBI is depending not only on the sample, but also on the adopted pipeline: thus, 
the sources of variabilities possibily affecting the final outcome are several. In fact, different alignment 
performance have been assessed for different software (Bowtie and BWA only to mention the two most 
popular ones), the preprocessing (with separated or unified lanes) and filtering procedures applied, and 
the sequencing platform. Thus the SBI can be seen as (a sort of) signature for the investigated sample 
with respect to the employed pipeline configuration, with the aim of using it to quantitatively formalize 
differences; for example, to be used as a baseline reference for the noise level that may be expected for a 
given configuration and thus, in some sense, for the stability of the configuration itself mimicking what 
has been carried out in [I]. To reach this goal, a similarity measure for SBI is needed: in this paper, we 
propose to use concepts from Statistical Machine Learning to deal with this problem, and namely the 
theory of classifier performance comparison. 

In fact, the SBI can be seen as the confusion matrix of a classification problem on the four classes 
A,G,C,T, where the classifier is the alignment pipeline and the ground truth is the reference genome. 
Within this framework, several measures are available in literature for translating a confusion matrix 
into a single performance value: see for instance PHI] for a review of the more classical methods, or [S] 
for novel measures or [5HH] for totally different approaches. Among others, the Matthews Correlation 
Coefficient (MCC, for short) has recently attracted the attention of many researchers in the field: in 
particular, it has been designed as the elective measure in initiatives defining methodologic guidelines 
such as MAQC-II [S] led by the U.S. Food and Drug Administration (FDA). Originally introduced for 
binary problems [TU] , its generalization to the multiclass case was defined in [TT] and since then used in a 
wide range of tasks. Here the (generalized) MCC is used to summarize in a unique real value the quality 
of the alignment of a sample: the range of the values is [—1, 1] and the higher the value the better the 
alignment, with 1 representing the ideal situation of no mismatch occurring. 
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In addition to the comparison with a common ideal situation, it is worthwhile to consider the mu- 
tual similarities and differences between SBIs: we perform this by looking at each SBI as a (nominal) 
histogram, and use a suitable distance. Among the many available distances [121113] . we chose to use 
the Canberra distance [13]. This is motivated by the fact that, being defined as the ratio between the 
(absolute value) of the difference and the sum of two quantities, the Canberra distance is intrinsically 
more susceptible to even slight changes around zero. This case is indeed the one we are interested in, 
since when considering the (frequency) histogram corresponding to a SBI, the values inherited by the 
single-base mismatches are very close to zero because of the preponderance of the occurrences of the 
exact matches. 

As an application of these two similarity measures, we show the evaluation of the alignment of a NGS 
dataset where all the components of the pipeline are fixed, yielding the tissue type as the sole controlled 
source of variability. The analysis with both MCC and Canberra distance leads to the hypothesis that 
there is a strong relation between the tissue type and the corresponding SBI structures, thus representing 
a first promising step towards the development of a stability theory for NGS data through the SBI. 

Materials and Methods 

Deep sequencing data 

A NGS data set, denoted here as "BMl", was used as testbed for evaluating the proposed method. 
The data set, previously described in [15] and available on EMBL-EBI European Nucleotide Archive 
(ENA) with accession number SRP000727, consists of deep transcriptome sequencing (RNA-seq) of 11 
human tissues and 6 cell lines. The biological samples were sequenced on the lUumina Genome Analyzer 
(Illumina, Inc., San Diego, CA, USA), obtaining over 400 million short (32 bp) reads. In case a biological 
sample was sequenced over multiple lanes, in the following we refer to each single lane as a sample. The 
data considered here include a total of 103 samples (lanes) distributed as in Table [1] 

Alignment and postprocessing 

The short reads were mapped to the reference human genome (UCSC assembly hgl8, NCBI Build 
GRCh36.1), including unordered sequences and alternate haplotypes, with either BWA 0.5.7 [16] (align- 
ment Al), bowtie 0.12.5 [17] (alignment A2), and TopHat 1.1.4 [T^ (alignment A3). The mapping was 
performed allowing at most two nucleotide differences (i.e., mismatches or gaps) over the read length. 

For each tissue or cell line and for each alignment method, a consensus sequence of the whole genome 
(i.e., a genetic sequence of mapped regions, including variants) was called from read mappings using 
SAMtools 0.1.7a [19] with the model implemented in MAQ [20]. Small variants (i.e., single-base mis- 
matches) were subsequently called from the consensus sequences and filtered according to read coverage 
and mapping quality constraints. In detail, only variants with a mapping Phred quality score larger than 
30 and covered by at least 3 (to prevent spurious calls) and at most 100 reads (to avoid calls in regions 
with very high depth) were kept, following guidelines commonly adopted in the literature [21[|22| and 
suggested by Illumina, Inc. for small variant discovery [23]. Finally, we computed an additional "virtual 
sample" by merging the alignments of different lanes, when present. 

Single Base Indicator 

A typical outcome of any alignment software is a summary of the exact matches and the occurring single- 
base mismatches, and these information may appear under very different shapes. For our purposes, 
we are interested in collecting (for each sample, or lane s) the 16 integer values X > Y counting the 
number of times the base X in the reference genome has been read as Y by the alignemnt pipeline, where 
X,Y € B = {A,C,G,T}. The 16 values are then organized in a square matrix SBI(s) G M{\B^\,N), 



3 



called Single Base Indicator (SBI), or as a nominal (i.e., with no ordering among the 16 categories X > Y) 
histogram, possibily normalized over the total number of counts TC = SBI(s),jj. An example of two 

SBI histograms is shown in Figure[TJ respectively for the two samples SRR015311 (skeletal muscle tissue) 
and SRR015286 (mixed brain tissue) in BMl. 



Distance from perfect alignment 

Starting from the SBI matrix, define two matrices X,Y E A^(TC x F2) where Xgn = 1 if the base s 
is predicted to be of class n (pc(s) = n) and X^n = otherwise, and Kj„ = 1 if base s belongs to class n 
(tc(s) = n) and otherwise. Using Kronecker's delta function, the definition becomes: 

X = ((5pc(.),„),„ Y = ((5tc(.),„)^„ . 

Then the Matthews Correlation Coefficient MCC can be defined as the ratio: 

cov{X, Y) 



MCC = 



y/cov{x,x) ■ cov(r,r) 



where cov(-, •) is the covariance function. In terms of the matrix SBI, the above equation can be written 
as: 
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MCC lives in the range [—1, 1], where 1 is perfect classification, —1 is reached in the alternative extreme 
misclassification case of a confusion matrix with all zeros but in two symmetric entries SBI^j, SBIj_j, and 
when the confusion matrix is all zeros but for one single column (all samples have been classified to be 
of a class k), or when all entries are equal SBI^j = K G N. Note that MCC is invariant with respect to 
multiplication of all SBFs entries by a constant. 



Canberra distance between SBIs 

Given the histogram (normalized over the respective TC) of the SBI for two samples s, t it is possible to 
define their Canberra distance as follows: 



Ca(s,i)= E 



|SBI(s)y - SBI(t),j 
SBI(s),j + SBI(t)„- 



Because of the shape of the denominator, variations on the mismatch categories tend to get higher weight 
in the sum than the exact matches, thus highlighting the differences on the wrongly assigned bases of the 
two compared samples. 



Sammon's mapping 

The projection of SBIs into a two-dimensional space was carried out by the nonlinear scaling algorithm 
proposed by Sammon [53], using Ca(s, t) as the distance function. The mapping is achieved by minimizing 
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the Sammon stress function, which in our terms is defined as 



E = 



E.^tCa*(s,i) 



1 



E 



Ca*(s,0 



where Ca*(s, t) is the Canberra distance of samples s, t in the original space and Ca(s, t) is the Canberra 
distance in the projected space. 



The first part of the analysis consists in computing the MCC values for all the elements of the data set. 
In order to speed up computations and to ease data handling, the SBI matrices are loaded in a portable 
SQLite database. To better highlight differences, in Figures [5] (alignment Al), 2] (alignment A2) andlH 
(alignment A3) we ordered the 123 samples (including 24 virtual samples) of data set BMl for decreasing 
MCC and displayed on the y axes the quantity 10^(1 — MCC); thus, the smaller the numbers (leftmost 
samples), the closer the SBI to the perfect alignment case. The main observation here is that samples 
belonging to the same class tend to have very similar MCC values, and thus to group together in the plot: 
in Tables [5J [3] and 13] we list the rank of all BMl samples grouped by class and sorted in decreasing MCC 
order, for alignments Al, A2 and A3. The samples of the brain class are the best aligned in average, 
with mixed brain (whose tissue is biogically close to brain tissue) ranking first for Al and A2. On the 
other hand, almost all the merged virtual samples have the worst alignment quality in terms of number 
of single-base mismatches. 

These claims are furthermore supported by the dendrogram in Figure [H where the mutual 123 Can- 
berra distances among SBI histograms from alignment Al are hierarchically clustered together with 
complete linkage: Table [5] summarizes the structure of the clusters resulting by cutting the dendrogram 
at height 2. Similar results are reported for alignments A2 (Fig. [5l Tab. |6]) and A3 (Fig. [71 Tab. Uil- 
Again, for many classes, samples in the same class are consistently grouped together, with the merged 
lanes virtual samples forming a separate entity. 

In both analyses, samples of the same class tend to quantitatively show an intrinsinc similarity. This 
result seems to support our claim that the SBI mismatch profile associated to a sample is strongly 
dependent on its tissue type, when all other sources of variability within the preprocessing pipeline are 
kept stable. 

Sammon's mapping of all BMl samples using Canberra distance between SBIs (Figures[8l[9| [T0l) shows 
how technical replicates form well-defined clusters stratified by tissue types, while a separate trend exists 
for the biological replicates (cerebellum samples), as expected. 

Conclusions 

We propose candidate indicators for profiling RNA-seq experiments based on generalized MCC and 
Canberra distance. The goal is to assess whether variability in RNA-Seq experiments depends on factors 
that resemble the analogous of "batch effects" for microarrays. 

Indicators are built on top of single-base mismatches, which are a first direct measurement of the good- 
ness of alignment. Results show that the proposed indicators make it possible to identify (sub)structures 
of data determined by underlying experimental characteristics such as tissue types, paving the way for 
a normalization method for NGS that could guarantee the reproducibility of results, as being nowadays 
explored by initiatives such as the Sequencing Quality Control (SEQC) led by U.S. FDA. 

The study is currently being extended to consider multiple NGS platforms, such as lUumina, Helicos, 



Results 



and SOLID. 
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Figure Legends 
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Figure 1. Alignment Al: histogram of the SBI for the two BMl samples SRR015311 (skeletal 
muscle tissue, in white) and SRR015286 (mixed brain tissue, in black) with different scales for the 12 
single-base mismatches categories (A) and the 4 exact matches categories (B). 
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BM1 (BWA): Sammon's mapping on Canberra-SBI 



o 



CVJ 





BT474 
HME 




heart 
liver 




Control-8S.HWI-E4 




MB435 




lymph_node 






MCF7 




mixed_brain 






T47D 




mixed_human_cell_lines 


Control-6S.HWI-E4 




adipose 




skeletal 


muscle 






brain 




testes 








breast 




cerebellum 






colon 












• 








Control-35.SNPSTER2 




• 












• 








Control-40.SNPSTER2 




• 










• 


• 

• • 
• 




Control-2S.SNPSTER 
« Control-2S.SNPSTER2 * 

• Control-17.SNPSTER2 i 

• Control-40.SNPSTER 




v;. 






• Control-17.SNPSTER 
• 








• 


• • • 


Control-35.SNPSTER 






• 




• 


• 








V. 

• 

• 

\ 





-4 -2 2 4 



SAM1 



Figure 8. Sammon map of BMl samples (alignment Al) using Canberra distance between 
SBIs: cerebellum samples are highlighted. 
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BM1 (bowtie): Sammon's mapping on Canberra-SBI 
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Figure 9. Sammon map of BMl samples (alignment A2) using Canberra distance between 
SBIs: cerebellum samples are highlighted. 
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BM1 (TopHat): Sammon's mapping on Canberra-SBI 
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Figure 10. Sammon map of BMl samples (alignment A3) using Canberra distance 
between SBIs: cerebellum samples are highlighted. 
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Table 4. Alignment A3: ranking of the samples of each class in decreasing MCC order 



Class Rank 

heart l,2,3,4,5,6,9,iJ^ 

brain 7,8,10,11,12,13,14, JO<S 

liver 15,16,17,18,19,22,23,iii 

skeletal musele 20,21,24,25,26,28,32,^^5 

mixed brain 27,31,38,40,^0^ 

testes 29,37,45,49,60,61,69,iJ5 

cerebellum 30,33,34,35,39,82,85,92,93,103,i^?6,-/07, 

BT474 36,41,43,48,52,56,59,ii5 

HME 42,44,46,47,51,54,70,JJ7 

lymph node 50,53,63,75,79,80,86, J J6 

colon 55,57,64,66,67,68,72, J^J 

adipose 58,62,65, 71,73,77,88,^^^ 

mixed human cell lines 74,83,87,^^75 

MB435 76,78,81,84,ig0 

MCF7 89,90,91,95, ii5 

breast 94,97,99,100, ii^ 

T47D 96,98,101, 102, 



Italic: merged lanes virtual sample. 
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Table 5. Alignment Al: elements (classwise) of the 24 clusters emerging by cutting the 
whole dendrogram at height 2. 



Cluster 


Elements 


1 


Brain 


2 


Heart, Liver 


3 


1 Mixed brain 


4 


3 Mixed brain 


5 


1 Breast, 3 MCF7, T47D, 5 Testes 


6 


1 Cerebellum 


7 


3 Cerebellum 


8 


5 Lymph node, 1 Testes 


9 


5 Colon, Adipose 


10 


3 Breast, 1 MCF7, 2 Lymph Node 


11 


5 HME, 2 Colon, Skeletal muscle 


12 


Mixed human cell lines 


13 


BT474, 2 HME 


14 


1 Testes, MB435 


15 


Mixed brain, Mixed human cell lines 


16 


Liver, Heart 


17 


1 Cerebellum 


18 


2 Cerebellum, 1 Cerebellum 


19 


2 Cerebellum, 1 Cerebellum 


20 


MB435, Testes 


21 


1 Cerebellum, Breast, Brain 


22 


1 Cerebellum, MCF7, BT474, T47D 


23 


Skeletal muscle 


24 


Lymph node, HME, Colon, Adipose 



Bold: all elements of the elass are included in the cluster; Italic: cluster includes the merged lanes 
virtual sample. 
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Table 6. Alignment A2: elements (classwise) of the 23 clusters emerging by cutting the 
whole dendrogram at height 2. 



Cluster 


Elements 


1 


3 Mixed brain 


2 


Heart, Liver 


3 


1 Mixed brain 


4 


Brain 


5 


BT474 


6 


5 HME, 1 Adipose, 2 Colon, Skeletal muscle 


7 


1 Breast, Mixed human cell lines, 5 Lymph node, 1 Testes 


8 


3 Breast, 1 MCF7, T47D, 2 Lymph node, 1 Testes 


9 


2 HME, 6 Adipose, 5 Colon 


10 


1 Cerebellum 


11 


3 Cerebellum 


12 


1 Cerebellum 


13 


Mixed brain, 2 Cerebellum, 1 Cerebellum 


14 


1 Testes, MB435 


15 


3 MCF7, 4 Testes 


16 


Liver, Heart 


17 


Brain, Mixed human cell lines 


18 


1 Cerebellum 


19 


1 Cerebellum, Breast, T47D, BT474 


20 


MB435, Testes 


21 


2 Cerebellum, 2 Cerebellum 


22 


MCF7, Lymph node 


23 


1 HME, 1 Colon, 1 Adipose, 1 Skeletal muscle 



Bold: all elements of the class are included in the cluster; Italic: cluster includes the merged lanes 
virtual sample. 
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Table 7. Alignment A3: elements (classwise) of the 32 clusters emerging by cutting the 
whole dendrogram at height 2. 



Cluster 


Elements 


1 


Brain 


2 


Skeletal muscle 


3 


1 Heart 


4 


5 Heart, 1 Liver 


5 


1 Heart, 6 Liver 


6 


1 Mixed brain 


7 


3 Mixed brain 


8 


5 BT474, 6 HME 


9 


1 Testes 


10 


Colon, 2 BT474, Adipose, 1 HME 


11 


4 Breast, T47D 


12 


3 MCF7 


13 


MB435, 1 MCF7 


14 


1 Testes, 1 Lymph node 


15 


6 Lymph node 


16 


Mixed human cell lines 


17 


5 Testes 


18 


2 Cerebellum 


19 


2 Cerebellum 


20 


2 Cerebellum 


21 


1 Cerebellum 


22 


1 Cerebellum 


23 


2 Cerebellum 


24 


Testes 


25 


2 Cerebellum 


26 


Liver, Heart 


27 


Mixed human cell lines 


28 


Breast, MCF7, Lymph node, T47D, MB435 


29 


BT474, HME, Colon, Adipose 


30 


2 Cerebellum 


31 


Brain 


32 


Skeletal muscle 



Bold: all elements of the class arc included in the cluster; Italic: cluster includes the merged lanes 
virtual sample. 



